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Abstract 

We study the problem of online regression. We do not make any assumptions 
about input vectors or outcomes. We prove a theoretical bound on the square 
loss of Ridge Regression. We also show that Bayesian Ridge Regression can be 
thought of as an online algorithm competing with all the Gaussian linear experts. 
We then consider the case of infinite-dimensional Hilbert spaces and prove relative 
loss bounds for the popular non-parametric kernelized Bayesian Ridge Regression 
and kernelized Ridge Regression. Our main theoretical guarantees have the form 
of equalities. 



1 Introduction 

In the online prediction framework we are provided with some input at each step 
and try to predict an outcome u sing this input and information from previous steps 
dCesa-Bianchi and Lugosi , 2006). In a simple case in statistics, it is assumed that each 



outcome is the value, corrupted by Gaussian noise, of a linear function of input. 

In competitive prediction the learner compares his loss at each step with the loss 
of any expert from a certain class of experts instead of making statistical assumptions 
about the data generating process. Experts may follow certain strategies. The learner 
wishes to predict almost as well as the best expert for all sequences. 

Our main result is Theorem Q] in the next section, which compares the cumula- 
tive weighted square loss of Ridge Regression applied in the on-line mode with the 
regularized cumulative loss of the best linear predictor. The power of this result can 
be best appreciated by looking at the range of its implications, both known an d new. 



For example, Corollary Q] answers the question asked by several researchers, see lVovk 



(12001 ), whether Ridge Regression has a relative loss bound with the regret term of the 
order In T under the square loss function, where T is the number of steps and the out- 
comes are assumed bounded; this corollary (as well as all other implications stated in 
Section [2]) is an explicit inequality rather than an asymptotic result. Theorem Q] itself 
is much stronger, stating an equality rather than inequality and not assuming that the 
outcomes are bounded. Since it is an equality, it unites upper and lower bounds on the 



loss. It appears that all natural bounds on the square loss of Ridge Regression can be 
easily deduced from our theorem; we give some examples in the next section. 

Most of previous research in online prediction considers experts that disregard the 
presence of noise in observations. We consider experts predicting a distribution on 
the outcomes. We use Bayesian Ridge Regression and prove that it can predict as 
well as the best regularized expert; this is our Theorem [2] The loss in this theoreti- 
cal guarantee i s the l ogarithmic loss. The algorithm that we apply was first used by 



DeSantis et al. I dl988h and similar bounds to ours were obtained by iKakade and Ngl 
( 20041) ; Kakade et aIT ( 2005 ). Theorem|2]is later used to deduce TheoremQ] Ridge Re- 
gression predicts the mean of the Bayesian Ridge Regression predictive distribution, 
and the logarithmic loss of Bayesian Ridge Regression is close to scaled square loss of 
Ridge Regression. 

We extend our main result to the case of infinite dimensional Hilbert spaces of func- 
tions. The algorithm used becomes an analogue of non-parametric Bayesian methods. 
From Theorem[2]and Theorem[T]we deduce relative loss bounds on the logarithmic loss 
of kernelized Bayesian Ridge Regression and on the square loss of kernelized Ridge 
Regression in comparison with the loss of any function from a reproducing kernel 
Hilbert space. Both bounds have the form of equalities. 

There is a lot of research done to prove upper and lower relative loss bounds un- 
der different loss functions. If the outcomes are assume d to be boun ded, the strongest 

known theoretical guarantees for square loss are given by Vovk ( 2001 ) and Azourv and Warmufhl 



(2001 



1 ) for the algorithm which we call VAW ( Vovk- Azoury- Warmuth) following lCesa-Bianchi and Lugosi 
(2006). In the case when the inputs and outcomes are not restricted in any way, like 
for our main guara ntees, it is possible to prove certain loss bounds for the Gradient 
Descent; see Cesa-Bianchi et al. I dl996h . 

In Section|2]of this paper we present the online regression framework and the main 
theoretical guarantee on the square loss of Ridge Regression. Section[3]describes what 
we call the Bayesian Algorithm. In Section [4] we show that Bayesian Ridge Regres- 
sion is competitive with the experts which take into account the presence of noise in 
observations. In Section[5]we prove the main theorem. Section|6]describes the case of 
infinite-dimensional Hilbert spaces. 



2 The prediction protocol and performance guarantees 

In online regression the learner follows this prediction protocol: 



Protocol 1 Online regression protocol 
for t = 1,2, . . . do 

Reality announces xt G K" 
Learner predicts 7* G M 
Reality announces y t G K 
end for 



We use the Ridge Regression algorithm for the learner: 
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Algorithm 1 Online Ridge Regression 



Require: a > 

Initialize b = e K", A Q = al e K 
fort = 1,2,... do 

Read x t € M" 

Predict 7 t = b'^A^Xt 

Read y t 

Update A t = A t -\ + x t x' t 
Update b t = b t -\ + y t x t 
end for 



Following this algorithm the learner's prediction at step T can be written as 

/T-l \ ' / T-l \ _1 

tt = ( ^2 ytxt \ ( aI + X] xta; t J xt - 

The incremental update of the matrix A^ 1 can be done effectively by the Sherman- 
Morrison formula. We prove the following theoretical guarantee for the square loss of 
the learner following Ridge Regression. 

Theorem 1. The Ridge Regression algorithm for the learner with a > satisfies, at 
any step T, 



(yt - it) 2 

> — t = mm 

I + x' t A-^X t fGR" 



J2(yt-o'xt) 2 + a\\e\\ 2 . (i) 



Note that the part x' t A t \x t in the denominator is usually close to zero for large t. 
A n equivalent equality is also o btained (but well hidden) in the proof of Theorem 4.6 



in lAzoury and Warmu th (2001). Our proof is more elegant. We describe it from the 
point of view of online prediction, but we note the connection with Bayesian learning 
in derivations. We obtain an upper bound in the form which is more familiar from 
online prediction literature. 

Corollary 1. Assume \y t \ < Y for all t, clip the predictions of Ridge Regression to 
[-Y, Y], and denote them by jY ■ Then 

J2(Vt-lY) 2 < mm ^(ift - 9'x t ) 2 + a\\6\\^j +4r 2 lndct (V + ^ X>^) ' 

(2) 

Proof. We first clip the predictions of Ridge Regression to [-Y, Y] in TheoremQ] In 
this case the loss at each step can only become smaller, and so the equality transforms 
to an inequality. Since all the outcomes also lie in [— Y, Y], the maximum square loss 
at each step is AY 2 . We have the following relations: 

1 -, / %'tA t \x t \ x' t A t \x t , ! 

and — <]n(l + x t A t _ lXt ). 



1 + x' t A t _ x x t \ 1 + x' t A t _ x x t J 1 + x' t A t _ x x t 
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The last inequality holds because x' t A t \xt is non-negative due to the positive definite- 
ness of the matrix A t -\. Thus we can use < ln(l + b), b > (it holds at b = 0, 
then take the derivatives of both sides). For the equality Y^t=i m (l + x 't^t\ x t) — 
In det J2t=i x t x 't) see ([16). 



□ 



The bound (|2) is exactly the bound obtained in Theorem 4 in IVovkl ( 200 1 ) for the 
algorithm merging linear experts with predictions clipped to [-Y, Y], which does not 
have a closed-form description and so is less interesting than clip ped Ridge R egression. 
The bound for the VAW algorithm obtained in Theorem 1 in IVovkl (1200 lh has Y in 
place of AY 2 (the VAW algorithm is very similar to Ridge Regression; its predictions 
are b' t _ 1 A^ 1 xt rather than b' t _ 1 A^_} 1 x t ). The regret term in (|2) has the logarithmic 
order in T if II if II <» < X for all t, because 




(3) 



(the determinant of a positive definite m atrix is bounded by the product o f its diagonal 
elements; see Chapter 2, Theorem 7 of Beckenbach and Bellman This bound 
is also obtained in Theorem 4.6 in lAzourv and Warmuth ( 2001 ). 



Fro m our Theorem[T]we can also deduce Theorem 11.7 of lCesa-Bianchi and Lugosi 
(2006), which is somewhat similar to our corollary. That theorem implies © when 
Ridge Regression's predictions happen to be in [— YJ Y] without clipping (but this is 
not what Corollary [T] asserts). 

The upper bound (|2) does not hold if the coefficient 4 is replaced by any nu mber 
less th an ^t-q ~ 2.164, as can be seen from an example given in Theorem 3 in lVovk 
(2001), where the left-hand side of (|2) is AT + o(T), the minimum in the right-hand 
side is at most T, Y = 1, and the logarithm is 2Tln2 + 0(1). It is also known that 
there is no algorithm achieving (|2) with the coefficient less than 1 i nstead of 4 even in 
the case where H^tH^ < X for all t; see Theorem 2 in IVovkl(l200lh . 

It is also possible to prove an upper bound without the logarithmic part on the 
cumulative square loss of Ridge Regression without assuming that the outcomes are 
bounded. 

Corollary 2. /jf||xt||2 < Z for all t then the Ridge Regression algorithm for the learner 
with a > satisfies, at any step T, 



T 

E 



(vt - It? < 





z 2 \ 






min 










rx t ) 2 + a\\6\\ 



(4) 



Proof lOazaz et al.1 d 19971) showed that 1 + x' t Aj x x t < 1 



take i = and obtain 1 + x' t A t \xt < 1 + Z 2 /a for any t. 



' t Ai x x t for j > i. We 
□ 



This bound is better than the bound in Corollary 3.1 of IKakade and N g (2004), 
which has an additional regret term of logarithmic order in time. 

Asymptotic proper ties of the Ridge R egression algorithm can be further studied 
using Corollary A. 1 in iKumon et al. (2009). It states that when \\xt\2 < 1 for all t, 
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then x' t A[_ 1 xt — > as t — > oo. It is clear that we can replace \\xtW2 < 1 for all t by 
sup t ||xt||2 < 00. The following corollary states that if there exists a very good expert 
(asymptotically), then Ridge Regression also predicts very well. If there is no such a 
good expert, Ridge Regression performs asymptotically as well as the best regularized 
expert. 

Corollary 3. Let a > and be the predictions output by the Ridge Regression 
algorithm with parameter a. Suppose sup t ||xt|| 2 < 00. 



1- If 



then 



2- If 



then 



36 e R" : J2(yt ~ &*t? < oo, (5) 
t=i 

00 

^2(yt -it) 2 < 00. 
t=i 

00 

V0 G M" : ^^(yt — O'xt) 2 = 00, (6) 
t=i 



lim SLi(w-7t) 2 = L (?) 



T-HX> 



min eeRn (j2t=i(yt ~ O'xt) 2 + a\\ 



Proof. Part 1. Suppose that the condition (0 holds. Then the ri ght-hand side o f (0 is 



bounded by a constant (independent of T). By Corollary A.l in lKumon et al . (200' 



the denominators in the left-hand side converge to 1 as t — > 00 and so are bounded. 



00. 



Therefore, the sequence J2t=i(Vt ~ It) 2 remains bounded as T 

Part 2. Suppose that the condition (0 holds and the right-hand side of ([T]) is 
bounded above by a constant C. Then for each T there exists Ot such that 

T 

5> t -^) 2 + a||0 T || 2 <C. 



It follows that each Ox belongs to the closed ball with centre and of radius yfCja. 
This ball is a compact set, and thus the sequence 6t has a subsequence that converges 
to some 6. For each T we have J2tli(Vt ~ 9'xt) 2 < C, because otherwise we would 
have Y^t=i [Vt ~ Q'f x t) 2 > C for a large enough T in the subsequence. Therefore, we 

have arrived at a contradiction: YjtLiiVt ~ 8 >x t) 2 < C < 00. 

Once we know that the right-hand side of (fl]i tends to 00 as T —> 00 a nd the denom 



inator s on the left-hand side tend to 1 (this is true by Corollary A.l in iKumon et al 



20091) . (0) becomes intuitively plausible since, as far as the conclusion © is concerned, 
we can ignore the finite number of ts for which the denominator 1 + x' t A^\x t is sig- 
nificantly different from 1. We will, however, give a formal argument. 
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The inequality > 1 in d7j is clear from (Q~|i and 1 + x' t A~[\xt > 1. We shall prove 
the inequality < 1 now. Choose a small e > 0. Then starting from some t = Tq we 
have that the denominators 1 + x' t A^_ x Xt are less than 1 + e. Thus, for T > Tq, 



t=l t=l t=T + l 

<f> + <! + *)£ rzf 
t=l t=l L + x t A t-l x t 

= f> ~ It? + (1 + 6) mm ( £(y f - fl'x 4 ) 2 + a ||0|| 2 ) . 
t=i \t=i / 

This implies that the left-hand side of (0 with lim replaced by lim sup does not exceed 
1 + e, and it remains to remember that e can be taken arbitrarily small. □ 



3 Bayesian algorithm 

In this section we describe the main algorithm used to prove our theoretical bounds. 
Let us denote the set of possible outcomes by f2, the index set for the experts by 9, and 
the set of allowed predictions by T. The quality of predictions is measured by a loss 
function A : T x -> R. We have £1 = R, = MP, and V is the set of all measurable 
functions on the real line integrable to one. The loss function A is the logarithmic loss 
^(7> y) — ~ m 7(j/)> where 7 £ T and y £ f2. The learner follows the prediction with 
expert advice protocol. 



Protocol 2 Prediction with expert advice protocol 

Initialize L := and L {6) = 0, V0 € 9 
fort = 1,2,... do 

Experts g announce their predictions G T 

Learner predicts 7t G T 

Reality announces y t E 

Losses are updated: L T = L T ^ + \{ lt ,y t ), L T {6) = £r-i(0) + A(#,ifc),V0 € 

e 

end for 



Here by Lt we denote the cumulative loss of the learner at step T, and by Lt(0) we 
denote the cumulative loss of the expert 9 at this step. 

We use a standard algorithm in prediction with expert advice (a special case of 
the Agg regating Algori t hm fo r the logarithmic loss function and learning rate 1, going 
back to DeSantis et al. (1988) in the case of countable 9 and f2) to derive the main 
theoretical bound and give predictions. We call it the Bayesian Algorithm (BA) as it 
is virtually identical to the Bayes rule used in Bayesian learning (the main difference 
being that the experts are not required to follow any prediction strategies). Instead of 
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looking for the best expert, the algorithm considers all the experts and takes a weighted 
average of their predictions as its own prediction. In detail, it works as follows. 



Algorithm 2 Bayesian Algorithm 

Require: A probability measure P (dO) — P^{d6) on 8 (the prior distribution, or 
weights) 

fort = 1,2,... do 

Read experts' predictions £f e T, V# G G 
Predict 5t = / e e?P t *_ 1 (<») 
Read y t 

Update the weights P t (d9) = $ \y t )Pt-i{dJd) 
Normalize the weights P t *(d6>) = P t (d0)/ J e P t (d0) 
end for 



The experts' weights are updated according to their losses at each step: £,f {yt) — 
e ~ x (it >vt)- larger losses lead to smaller weights. After t steps the weights become 

P t (d6) = e- Lt ^P (d9). (8) 

The normalized weights P^(dd) correspond to the posterior distribution over 6 after 
the step T. As we said, the prediction of the BA at step T is given by the average 



9T 



(9) 



of the experts' predictions. 

The next lemma is a special case of Lemma 1 in lVovkl(l200ll) . It shows that the cu- 
mulative loss of the BA is an a verage of the exper ts' cumulative losses in a generalized 
sense, as in, e.g., Chapter 3 of lHardv et al.l (11952b . 

Lemma 1. For any prior Pq and any T — 1,2,..., the cumulative loss of the BA can 
be expressed as 



Lt = — In 



(-) 



- LT{9) P {d6). 



(10) 



Proof. We proceed by induction in T: for T = the equality is obvious, and for T > 
we have: 



L T = L T -\ - \ng T (y T ) 

= -ln / e^-'f^PoW - In / Ct 



-Po(d6) 



= - In f e- LT{e) P {d9) 
Je 



(the second equality follows from the inductive assumption, the definition of g?, and 
©). □ 
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4 Bayesian Ridge Regression as a competitive algorithm 



Let us consider experts whose predictions at step t are the densities of the normal 
distributions N(9'xt,cr 2 ) on the set of outcomes for some fixed variance a 2 > (so 
each expert 6 follows a fixed strategy). From the statistical point of view, they predict 
according to the model y t = O'xt + e t with Gaussian noise e t ~ N(0,a 2 ). In other 
words, the prediction of each expert 8 £ O is 

Let us take the initial distribution iV(0, — I) on the experts with some a > 0: 

/2 



P °(M) = (^z) exp(- — \\0\\ 2 )d9. 



We will prove that in this setting the prediction of the Bayesian Algorithm is equal 
to the prediction of Bayesian Ridge Regression. But first we need to introduce some 
notation. For t £ {1, 2, . . .}, let Xt be the txn matrix of row vectors x^, . . . ,x' t and Y t 
be the column vector of outcomes yi, . . . ,y t . Let A t — X' t X t +aI, as before. Bayesian 
Ridge Regression is the algorithm predicting at each step T the normal distribution 
]V(7t, o"|,) with the mean and variance given by 

7t = Y^_ 1 Xt-iA^, 1 _ 1 xt, Ot = a 1 x' T A7j}_ x xt + a 2 (12) 
for some a > and the known noise variance a 2 . 

Lemma 2. In our setting the prediction (O of the Bayesian Algorithm is the prediction 
density of Bayesian Ridge Regression in the notation of ( 112b : 

i _ (v-'<t) 2 

^=7m'~^- <13) 

Proof. The prediction 

9T(y)= [ &(y)P^(dO) = — _ i^gL. °_U. 







formally coincides with the density of the predictive distribution o f the Bayesian G aus- 
sian linear model, and so equality ( fT3l ) is true: see Section 3.3.2 of lBishopI d2006l) . □ 

Remark 1. From th e probabilistic po i nt of view Lemma [2] is usually explained in 
the following way dHoerl and Kennard , 2000l) . The posterior distribution P?p_ l (0) is 



N{Az}_,X! r _- i Yx-ii cr 2 A^,^_ 1 ). The conditional distribution of 8'xt given the train- 
ing examples is then N (Y^,_ 1 Xx-iA^,_ 1 Xx , o 2 x' t A~^}_ x xt), and so the predictive 
distribution is N(Y^_ 1 Xt-iA^ 1 xt, o 2 x' t A^}_ x xt + cr 2 ). 



8 



For the subsequent derivations, we will need the fo llowing well-known lemma 



whose proof can be found in Lemma 8 o f Busuttil (2008) or extracted from Chapter 2 



Theorem 3 of Beckenbach and Bellman (1961). 



Lemma 3. Let W{6) = 9'A9 + b'9 + cfor 9,b G R™ c be a scalar, and A be a 
symmetric positive definite n x n matrix. Then 

7T™/2 

e- w Wd6 = e- w °^= 



Vdet A ' 



where Wo = mine W{6). 

The right-hand side of ( fTOt can be transformed to the regularized cumulative loss 
of the best expert 9 and a regret term: 

Theorem 2. For any sequence x\, yi, X2, 2/2, ■ • ■ , the cumulative logarithmic loss of 
the Bayesian Ridge Regression algorithm (113t at any step T can be expressed as 



1 



L T = mm (L T {9) + ^ ||0|| 2 J + - lndct | /+-> ,, ,., ; ) (14) 

||^t||oo < X for any t — 1,2,..., then 



4=1 / 



TX 2 \ 

L T <mm{L T (9) + ^\\9\\*)+^ln\l + — ) . (15) 
Proof. We have to calculate the right-hand side of ( fTOb . The integral is expressed as 

X {2na 2 ) T / 2 (2^) 
By Lemma[3]it is equal to 



1,2 e -^{T.U{y t -o'^?+a\\ef) de ^ 



7 2)T/2 ( 2 CJ 2 t!) 



1,2 p - i ^(EL 1 fe t -0o-t) 2 +a||e o || 2 ) * 



n/2 



(2ira 2 ) T / 2 \2a 2 nJ ' VdiO^' 

where At is the coefficient matrix in the quadratic part: At = 2 T 7 ( a ^ + Et=i x t x 't) 
and 9q is the best predictor: 9$ — argming {^2n =1 {yt — 9'x t ) 2 + a\\9\\ 2 ^j. Taking the 
minus logarithm of this expression we get 



T 

2> 



_L= e -ri,(,-»w') + _^ l|9oP + i tadet ( J+ Ig^ 



t=i 

To obtain the upper bound ( fTBT l it suffices to apply (0). □ 

This theorem shows that the Bayesian Ridge Regression algorithm can be thought 
of as an online algorithm successfully competing with all the Gaussian linear models 
under the logarithmic loss functi on. Similar bounds on t he logarithmic loss of Bayesian 



Ridge Regression are proven by Kakade and Ng (2004) 



9 



5 Proof of Theorem 3] 

Let us rewrite Lt and Lt{&) using ( fT3l l. the expression for of given by ( TTZi i. and ( TTTb : 



L T = -yin( — 



_ (y—m ) 



^^ ( wrn (1+ 4A->.)) + ^g T ^ ; , 

T 

= ^ln(2^ 2 ) + ^^(y f -^ t ) 2 - 
t=i 

Substituting these expression into (TBI we have: 
1 In TT(1 + + * V (yt = 7 f 

^min ^(ift-^t) a +a||0|| a J + ±lndct ^+^f>^ • 



2d 

Equation (Q~|) follows from the fact that 



det = Ii^ 1 + x 't A t-i x t) ( 16 > 

V a t=i J t=i 

for A t — al + Yli=i x i x i- This fact can be proven by induction in T: for T = it is 
obvious (1 = 1) and for T > 1 we have 

dct ^1 + ~y x t 2;^ = a - ™ det A T = a~™ det (A T -i + x T x' T ) 

= a "(1 + x^Aj^^xt) det At-i = dct I / H — 2:4X4 J (1 + x^A^^t) 



t=i 



n(i+4A"-i 



The third eq uality follows f rom the Matrix Determinant Lemma: see, e.g., Theo- 



rem 18.1.1 of Harville ( 1997|). The last equality follows from the inductive assumption. 



Note that a 2 canceled out; this is natural as Ridge Regression (unlike Bayesian Ridge 
Regression) does not depend on a. 



10 



6 Kernelized Ridge Regression 



In this section we prove bounds on the square loss of kernelized Ridge Regression. 
We also prove bounds on the logarithmic loss for a commonly used non-parametric 
Gaussian algorithm: kernelized Bayesian Ridge Regression. These bounds explicitly 
handle infinite dimensional classes of experts. 

Let X be an arbitrary set of inputs. We define a reproducing kernel Hilbert space 
(RKHS) J- of functions X — >• M as a functional Hilbert space with continuous evalua- 
tion functional / ef 4 f(x) for each x G X. By the Riesz-Fischer theorem for any 
x € X there is a unique k x € T such that (k x , f)jr = f{x) for any / € T. The kernel 
JC : X 2 —> R of the RKHS J- is defined as ^C(xi,x g) = (k x , , k Xn ) for any x\, X 2 € X. 
For more information about kernels please refer to lScholkopf and Smola ( 2002 ). 



Let us introduce some notation. Let K t be the kernel matrix K(xi,Xj) at step f, 
where i, j = 1, . . . , t. Let k t be the column vector K,(xi,xt) for i = 1, . . . , t — 1. As 
before, Y t is the column vector of outcomes y\, . . . , y t . The kernelized Ridge Regres- 
sion is defined as the learner's strateg y in Protocol [D t hat p redicts 77- = Y T _ 1 (aI + 



t-i) kr at each step T; see, e.g. JSaunders et al.l dl998). The following theorem 



is an analogue of Theorem Q] for kernelized Ridge Regression; in its proof we will see 
how kernelized Ridge Regression is connected with Ridge Regression. 

Theorem 3. The kernelized Ridge Regression algorithm for the learner with a > 
satisfies, at any step T, 



(yt - it) 2 



^ 1 + (K.(x u xt) - K(al + K t _!)-ik t )/a 




/(* t )) 2 + a||/ll. 



Proof. It suffices to prove that for each T g {1,2,.. .} and every sequence of input 
vectors and outcomes (xi,yi, . . . ,xt, J/t) € (X x M.) T the equality ( TPTT i is satisfied. 
Fix such T and (x\, yi, . . . , xt, yr)', our goal is to prove dTTb - Fix an isomorphism 
between the linear span of k Xl k XT and R T , where T < T is the dimension of 
the linear span of k Xl , . . . , k XT . Let x 1 , . . . , xt € R T be the images of k Xl , . . . , k XT , 
respectively, under this isomorphism. Notice that, for all t, Kt is the matrix (xi,Xj), 
i,j = 1, . . . , t, and k t is the column vector (x,, it) for i = 1, . . . , t — 1. We know 
that (Q]) with Xt in place of Xt and 74 in place of j t holds for Ridge Regression, 
whose predictions are now denoted 74 (in order not to confuse them with kernelized 
Ridge Regression's predictions j t ). The predictions output by Ridge Regression on 
Xi, 7/1, . . . , xt, Vt and by kernelized Ridge Regression on x±, y\, . . . , Xt, yT are the 
same: 

7i = YUial + K t _i) _1 k t = YUial + X t __iX , t _ 1 )- 1 X t -. 1 x t 

= Y^Xt^al + X^Xt-i)- 1 ^ - it 

(for the notation see ( fT2] i. with tildes added). The denominators in ( TT7T > and ([T]) are also 
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the same: 



1 + {JC{x u x t ) - k' t {al + Kt-iJ^ktVa 

= 1 + x' t (I - X' t ^{al + Xt^X^J-ijtt-Jxt/a 
= 1 + x' t (al + ^_ 1 X t _ 1 )" 1 ((a/ + X^Xt-x) - X'^X^Xt/a 

= l + g t (aI + Xl_ 1 Xt-i)- 



x t - 



T he right-hand sides are the sa me by the representer theorem (see, e.g., Theorem 4.2 



m 



Scholkopf and Smola, 2002). Indeed, by this theorem we have 



mm 



£(lft-/(*t)) 2 + a||/|| 



mm 

ci,...,ct^I 



X y * - ^Ci/C(a;i,a; t ) 



= mm 
ci . . . .,cr£l 



i=l 
T i 

£ 

t=i 




(the last equality holds due to the isomorphism). Denoting 8 = J2i=i °i x i <= ^ T 
we obtain the expression for the minimum in ((TJ: 9 ranges over the whole of M. T (as 



Ci, . 



, ct range over K) since x%, . . . , Xt span . 



□ 



Similarly to the proof of Theorem [3] we can prove an analogue of Theorem [2] for 
kernelized Bayesian Ridge Regression. At step T kernelized Bayesian Ridge Re- 
gression predicts the normal density on outcomes with the mean 7^ and variance 
a 2 + a 2 (JC(xt t xt) — k' T (aI + KT-i) -1 kT)/a. We denote by Lt the cumulative 
logarithmic loss, over the first T steps, of the algorithm and by Lt{J) the cumulative 
logarithmic loss of the expert / predicting normal density with the mean f(xt) and 



variance a 



Theorem 4. For any sequence x\, y\, x%, 2/2 > ■ ■ ■ , the cumulative logarithmic loss of 
the kernelized Bayesian Ridge Regression algorithm at any step T can be expressed as 



Ln 



mm 



in(£ T (/) 



2a 2 



l/l 



ln det / 



This theorem is proven by Kakad e et al 



We can see from Theorem 13.3.8 of 



d2005l) for a = 1. 



Harville< 1997) that 



1. 



-Kn 



det ( I+-K T ) = det 

a 



I + K T -i/a 
k' T /a 



K{xt , Xt) I a 



= det ( I + ~K r _i ) (1 + (K(x T ,x T ) - k' T (aI + K T - X )- l k T )/a), 
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and so by induction we have 

T 



det (i + ±K T J = f[(l + {K{x u x t ) - k' t (al + K^x)" 1 ^)/^, 



with k' x (al + Ko) 1 ki understood to be 0. Using this equality and following the argu- 
ments of the proof of Corollary Q] we obtain the following corollary from Theorem[3] 

Corollary 4. Assume \y t \ < Y for all t, clip the predictions of kernelized Ridge Re- 
gression to [— V, Y], and denote them by . Then 



T 



t=l 




(yt - ijf < mm I V( Wt - f{x t )f + a\\f\\l ) + 4F 2 In det (l+-K T 



(18) 



It is possible to prove this corollary directly from Corollary[T]using the same argument 
as in the proof of Theorem[3] 

The order of the regret term in ( fT~8b is not clear on the face of it. We show that it 
has the order 0(VT) in many cases. We will use the notation Cjr — sup^gx x). 
We bounding the logarithm of the determinant and obtain that In det (/ + iKy) < 

Tin + (cf. (O). If we know the number T of steps in advance, then we can 



choose a specific value for a; let a = cjrv T. Thus we get an upper bound with the 
regret term of the order O(VT) for any / e T: 

T T 

t=i t=i 

If we do not know the number of steps in advance, it is possible to achieve a similar 
bound using the Bayesian Algorithm with a suitable prior over the parameter a: 

T T 

$>t - ijf < £(yt - f(xt)) 2 + 8^ max (cHI/lb, Y5T- l ' 2+s ) VTT2 
t=i t=i 

+ 6Y 2 lnT + c 2 r\\f\\ 2 r + 0{Y 2 ) (19) 

for any arbitrarily small S > 0, where the constant implicit in 0(Y 2 ) depends only on 
S. (Proof omitted.) 

In particular, ( fT9l ) shows that if X is a universal kernel dSteinwarl . 2001 ) on a 
topological space X, Ridge Regression is competitive with all continuous functions on 
X: for any continuous / : X — > R, 

ll) 2 - f> - f(xt)) 2 ) < (20) 
t=i / 

(assuming \y t \ < Y for all t). For example, d20b holds for X a compact set in R n , K, 
an RB F kernel, and / : X — > K any continuous function, see Example 1 of lSteinwart 
d200lh . 
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